3 research outputs found
Document Collection Visualization and Clustering Using An Atom Metaphor for Display and Interaction
Visual Data Mining have proven to be of high value in exploratory data analysis and data mining because it provides an intuitive feedback on data analysis and support decision-making activities. Several visualization techniques have been developed for cluster discovery such as Grand Tour, HD-Eye, Star Coordinates, etc. They are very useful tool which are visualized in 2D or 3D; however, they have not simple for users who are not trained. This thesis proposes a new approach to build a 3D clustering visualization system for document clustering by using k-mean algorithm. A cluster will be represented by a neutron (centroid) and electrons (documents) which will keep a distance with neutron by force. Our approach employs quantified domain knowledge and explorative observation as prediction to map high dimensional data onto 3D space for revealing the relationship among documents. User can perform an intuitive visual assessment of the consistency of the cluster structure
The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
We present The Vault, an open-source, large-scale code-text dataset designed
to enhance the training of code-focused large language models (LLMs). Existing
open-source datasets for training code-based LLMs often face challenges in
terms of size, quality (due to noisy signals), and format (only containing code
function and text explanation pairings). The Vault overcomes these limitations
by providing 40 million code-text pairs across 10 popular programming
languages, thorough cleaning for 10+ prevalent issues, and various levels of
code-text pairings, including class, function, and line levels. Researchers and
practitioners can utilize The Vault for training diverse code-focused LLMs or
incorporate the provided data cleaning methods and scripts to improve their
datasets. By employing The Vault as the training dataset for code-centric LLMs,
we anticipate significant advancements in code understanding and generation
tasks, fostering progress in both artificial intelligence research and software
development practices